{ "cells": [ { "cell_type": "code", "execution_count": 21, "id": "76cd10db", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "from sklearn.preprocessing import LabelEncoder\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.metrics import accuracy_score, roc_auc_score" ] }, { "cell_type": "code", "execution_count": 22, "id": "e9c4b3af", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Loan_IDGenderMarriedDependentsEducationSelf_EmployedApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_HistoryProperty_AreaLoan_Status
0LP001002MaleNo0GraduateNo58490.0NaN360.01.0UrbanY
1LP001003MaleYes1GraduateNo45831508.0128.0360.01.0RuralN
2LP001005MaleYes0GraduateYes30000.066.0360.01.0UrbanY
3LP001006MaleYes0Not GraduateNo25832358.0120.0360.01.0UrbanY
4LP001008MaleNo0GraduateNo60000.0141.0360.01.0UrbanY
\n", "
" ], "text/plain": [ " Loan_ID Gender Married Dependents Education Self_Employed \\\n", "0 LP001002 Male No 0 Graduate No \n", "1 LP001003 Male Yes 1 Graduate No \n", "2 LP001005 Male Yes 0 Graduate Yes \n", "3 LP001006 Male Yes 0 Not Graduate No \n", "4 LP001008 Male No 0 Graduate No \n", "\n", " ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \\\n", "0 5849 0.0 NaN 360.0 \n", "1 4583 1508.0 128.0 360.0 \n", "2 3000 0.0 66.0 360.0 \n", "3 2583 2358.0 120.0 360.0 \n", "4 6000 0.0 141.0 360.0 \n", "\n", " Credit_History Property_Area Loan_Status \n", "0 1.0 Urban Y \n", "1 1.0 Rural N \n", "2 1.0 Urban Y \n", "3 1.0 Urban Y \n", "4 1.0 Urban Y " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#import data\n", "df = pd.read_csv('loan_data.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "id": "d3122ccf", "metadata": {}, "source": [ "### Missing Values" ] }, { "cell_type": "code", "execution_count": 23, "id": "cae1ab70", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Missing values in each column:\n", " Loan_ID 0\n", "Gender 13\n", "Married 3\n", "Dependents 15\n", "Education 0\n", "Self_Employed 32\n", "ApplicantIncome 0\n", "CoapplicantIncome 0\n", "LoanAmount 22\n", "Loan_Amount_Term 14\n", "Credit_History 50\n", "Property_Area 0\n", "Loan_Status 0\n", "dtype: int64\n" ] } ], "source": [ "#check for missing values\n", "missing_values = df.isnull().sum()\n", "print(\"Missing values in each column:\\n\", missing_values)" ] }, { "cell_type": "code", "execution_count": 24, "id": "1d307a2f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Missing values after imputation:\n", " Loan_ID 0\n", "Gender 0\n", "Married 0\n", "Dependents 0\n", "Education 0\n", "Self_Employed 0\n", "ApplicantIncome 0\n", "CoapplicantIncome 0\n", "LoanAmount 0\n", "Loan_Amount_Term 0\n", "Credit_History 0\n", "Property_Area 0\n", "Loan_Status 0\n", "dtype: int64\n" ] } ], "source": [ "#impute for missing values\n", "categorical_columns = ['Gender', 'Married', 'Dependents', 'Self_Employed', 'Loan_Amount_Term']\n", "for column in categorical_columns:\n", " df[column].fillna(df[column].mode()[0], inplace=True)\n", "\n", "df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)\n", "\n", "df['LoanAmount'].fillna(df['LoanAmount'].median(), inplace=True)\n", "\n", "#double check missing values\n", "missing_values_post_imputation = df.isnull().sum()\n", "print(\"Missing values after imputation:\\n\", missing_values_post_imputation)" ] }, { "cell_type": "markdown", "id": "f89dd152", "metadata": {}, "source": [ "#### Justification for Imputing Missing Values" ] }, { "cell_type": "markdown", "id": "adb887c4", "metadata": {}, "source": [ "Since this assignment requires making loan predictions, I believe that imputing missing values is preferable to dropping missing value rows entirely. The number of missing values in each column is relatively low compared to the overall dataset, so keeping rows by imputing helps preserve data, which helps improve predictions. I chose to impute categorical columns (Gender, Married, Dependents, etc) with their most common values (mode), as this maintains typical distribution patterns for these features. For numerical data like Loan Amount, I used median imputation, as the median is less sensitive to outliers." ] }, { "cell_type": "markdown", "id": "59993cd9", "metadata": {}, "source": [ "### Checking for Outliers" ] }, { "cell_type": "code", "execution_count": 6, "id": "cc4f4ceb", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#identifying outliers using IQR\n", "def identify_outliers(column):\n", " Q1 = df[column].quantile(0.25)\n", " Q3 = df[column].quantile(0.75)\n", " IQR = Q3 - Q1\n", " lower_bound = Q1 - 1.5 * IQR\n", " upper_bound = Q3 + 1.5 * IQR\n", " outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]\n", " return outliers\n", "\n", "#create graphs for visualization\n", "plt.figure(figsize=(12, 6))\n", "for i, column in enumerate(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount']):\n", " plt.subplot(1, 3, i + 1)\n", " sns.boxplot(data=df, y=column)\n", " plt.title(f'Box Plot of {column}')\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "07140005", "metadata": {}, "source": [ "As we can see from the above box plots, there are a number of outliers in the numerical columns." ] }, { "cell_type": "code", "execution_count": 7, "id": "0f7a4574", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Outliers in ApplicantIncome:\n", "Index: 9, ApplicantIncome: 12841\n", "Index: 34, ApplicantIncome: 12500\n", "Index: 54, ApplicantIncome: 11500\n", "Index: 67, ApplicantIncome: 10750\n", "Index: 102, ApplicantIncome: 13650\n", "Index: 106, ApplicantIncome: 11417\n", "Index: 115, ApplicantIncome: 14583\n", "Index: 119, ApplicantIncome: 10408\n", "Index: 126, ApplicantIncome: 23803\n", "Index: 128, ApplicantIncome: 10513\n", "Index: 130, ApplicantIncome: 20166\n", "Index: 138, ApplicantIncome: 14999\n", "Index: 144, ApplicantIncome: 11757\n", "Index: 146, ApplicantIncome: 14866\n", "Index: 155, ApplicantIncome: 39999\n", "Index: 171, ApplicantIncome: 51763\n", "Index: 183, ApplicantIncome: 33846\n", "Index: 185, ApplicantIncome: 39147\n", "Index: 191, ApplicantIncome: 12000\n", "Index: 199, ApplicantIncome: 11000\n", "Index: 254, ApplicantIncome: 16250\n", "Index: 258, ApplicantIncome: 14683\n", "Index: 271, ApplicantIncome: 11146\n", "Index: 278, ApplicantIncome: 14583\n", "Index: 284, ApplicantIncome: 20667\n", "Index: 308, ApplicantIncome: 20233\n", "Index: 324, ApplicantIncome: 15000\n", "Index: 333, ApplicantIncome: 63337\n", "Index: 369, ApplicantIncome: 19730\n", "Index: 370, ApplicantIncome: 15759\n", "Index: 409, ApplicantIncome: 81000\n", "Index: 424, ApplicantIncome: 14880\n", "Index: 432, ApplicantIncome: 12876\n", "Index: 438, ApplicantIncome: 10416\n", "Index: 443, ApplicantIncome: 37719\n", "Index: 467, ApplicantIncome: 16692\n", "Index: 475, ApplicantIncome: 16525\n", "Index: 478, ApplicantIncome: 16667\n", "Index: 483, ApplicantIncome: 10833\n", "Index: 487, ApplicantIncome: 18333\n", "Index: 493, ApplicantIncome: 17263\n", "Index: 506, ApplicantIncome: 20833\n", "Index: 509, ApplicantIncome: 13262\n", "Index: 525, ApplicantIncome: 17500\n", "Index: 533, ApplicantIncome: 11250\n", "Index: 534, ApplicantIncome: 18165\n", "Index: 561, ApplicantIncome: 19484\n", "Index: 572, ApplicantIncome: 16666\n", "Index: 594, ApplicantIncome: 16120\n", "Index: 604, ApplicantIncome: 12000\n", "\n", "\n", "Outliers in CoapplicantIncome:\n", "Index: 9, CoapplicantIncome: 10968.0\n", "Index: 12, CoapplicantIncome: 8106.0\n", "Index: 38, CoapplicantIncome: 7210.0\n", "Index: 122, CoapplicantIncome: 8980.0\n", "Index: 135, CoapplicantIncome: 7750.0\n", "Index: 177, CoapplicantIncome: 11300.0\n", "Index: 180, CoapplicantIncome: 7250.0\n", "Index: 253, CoapplicantIncome: 7101.0\n", "Index: 349, CoapplicantIncome: 6250.0\n", "Index: 372, CoapplicantIncome: 7873.0\n", "Index: 402, CoapplicantIncome: 20000.0\n", "Index: 417, CoapplicantIncome: 20000.0\n", "Index: 444, CoapplicantIncome: 8333.0\n", "Index: 506, CoapplicantIncome: 6667.0\n", "Index: 513, CoapplicantIncome: 6666.0\n", "Index: 523, CoapplicantIncome: 7166.0\n", "Index: 581, CoapplicantIncome: 33837.0\n", "Index: 600, CoapplicantIncome: 41667.0\n", "\n", "\n", "Outliers in LoanAmount:\n", "Index: 5, LoanAmount: 267.0\n", "Index: 9, LoanAmount: 349.0\n", "Index: 21, LoanAmount: 315.0\n", "Index: 34, LoanAmount: 320.0\n", "Index: 54, LoanAmount: 286.0\n", "Index: 67, LoanAmount: 312.0\n", "Index: 83, LoanAmount: 265.0\n", "Index: 126, LoanAmount: 370.0\n", "Index: 130, LoanAmount: 650.0\n", "Index: 135, LoanAmount: 290.0\n", "Index: 155, LoanAmount: 600.0\n", "Index: 161, LoanAmount: 275.0\n", "Index: 171, LoanAmount: 700.0\n", "Index: 177, LoanAmount: 495.0\n", "Index: 233, LoanAmount: 280.0\n", "Index: 253, LoanAmount: 279.0\n", "Index: 258, LoanAmount: 304.0\n", "Index: 260, LoanAmount: 330.0\n", "Index: 278, LoanAmount: 436.0\n", "Index: 308, LoanAmount: 480.0\n", "Index: 324, LoanAmount: 300.0\n", "Index: 325, LoanAmount: 376.0\n", "Index: 333, LoanAmount: 490.0\n", "Index: 351, LoanAmount: 308.0\n", "Index: 369, LoanAmount: 570.0\n", "Index: 372, LoanAmount: 380.0\n", "Index: 381, LoanAmount: 296.0\n", "Index: 391, LoanAmount: 275.0\n", "Index: 409, LoanAmount: 360.0\n", "Index: 432, LoanAmount: 405.0\n", "Index: 487, LoanAmount: 500.0\n", "Index: 506, LoanAmount: 480.0\n", "Index: 514, LoanAmount: 311.0\n", "Index: 523, LoanAmount: 480.0\n", "Index: 525, LoanAmount: 400.0\n", "Index: 536, LoanAmount: 324.0\n", "Index: 561, LoanAmount: 600.0\n", "Index: 572, LoanAmount: 275.0\n", "Index: 592, LoanAmount: 292.0\n", "Index: 600, LoanAmount: 350.0\n", "Index: 604, LoanAmount: 496.0\n", "\n", "\n" ] } ], "source": [ "#shows the outliers in each column and the index of the row.\n", "def print_outliers(column):\n", " outliers = identify_outliers(column)\n", " print(f\"Outliers in {column}:\")\n", " for idx, value in outliers[column].items():\n", " print(f\"Index: {idx}, {column}: {value}\")\n", " print(\"\\n\")\n", "\n", "print_outliers('ApplicantIncome')\n", "print_outliers('CoapplicantIncome')\n", "print_outliers('LoanAmount')" ] }, { "cell_type": "markdown", "id": "4360fa74", "metadata": {}, "source": [ "As we can see in the above printout and graphs, there are quite a large number of outliers. However, I believe the correct option is to keep the outliers in the dataset. The three columns that were checked are ApplicantIncome, CoapplicantIncome, and LoanAmount. In the real world, there's a large amount of variance in all of these three columns. Different applicants with different incomes have different loans to apply to, so it makes sense that the values would range. Thus, I believe it makes sense to keep the outliers, in order to have a good view of the differences in loan application and acceptance that happens in the real world." ] }, { "cell_type": "markdown", "id": "0127a19e", "metadata": {}, "source": [ "### Discretizing ApplicantIncome, CoapplicantIncome, LoanAmount, and Loan_Amount_Term" ] }, { "cell_type": "code", "execution_count": 8, "id": "ba4175cd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term\n", "0 Low Income No Co-Applicant Small Loan Long Term\n", "1 Low Income Low Income Small Loan Long Term\n", "2 Low Income No Co-Applicant Small Loan Long Term\n", "3 Low Income Low Income Small Loan Long Term\n", "4 Low Income No Co-Applicant Small Loan Long Term\n" ] } ], "source": [ "#discretizing columns\n", "df['ApplicantIncome'] = pd.cut(df['ApplicantIncome'], \n", " bins=[0, 30000, 80000, float('inf')], \n", " labels=['Low Income', 'Medium Income', 'High Income'])\n", "\n", "df['CoapplicantIncome'] = pd.cut(df['CoapplicantIncome'], \n", " bins=[-1, 0, 15000, 40000, float('inf')], \n", " labels=['No Co-Applicant', 'Low Income', 'Medium Income', 'High Income'])\n", "\n", "df['LoanAmount'] = pd.cut(df['LoanAmount'], \n", " bins=[0, 150, 300, float('inf')], \n", " labels=['Small Loan', 'Medium Loan', 'Large Loan'])\n", "\n", "df['Loan_Amount_Term'] = pd.cut(df['Loan_Amount_Term'], \n", " bins=[0, 120,240, float('inf')], \n", " labels=['Short Term', 'Medium Term', 'Long Term'])\n", "\n", "#check changes\n", "print(df[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']].head())" ] }, { "cell_type": "markdown", "id": "f00573e8", "metadata": {}, "source": [ "In each of the above columns, I divided the values into a few different categories. For ApplicantIncome, I chose to divide it into Low Income (0 - 30,000), Medium Income (30,001 to 80,000), and High Income (80,001 and above.) This was done to generally imitate income values in the real world, while specifically looking at the average values in this data set.\n", "\n", "For CoapplicantIncome, I chose to do effectively the same thing - however, I also added a label for the columns with 0 income, which I believe means that there isn't a coapplicant present. \n", "\n", "For LoanAmount, the divisions between small, medium and large loan were effectively planned around the average of loan amount values. The average for the column was 146.41, which places my values at good markers for what small, medium, and large loans might look like.\n", "\n", "Lastly, for Loan_Amount_Term, it wasn't exactly as with the other columns. The terms were of a few set amounts - 84, 120, 180, 240, 360, and 480. Due to this, I broke the loans up into categories, Short (0-120), Medium (121-240) and Long (above 240)." ] }, { "cell_type": "code", "execution_count": 9, "id": "33ca2d8a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Selected Features for Predictors:\n", "['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']\n", " ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term\n", "0 Low Income No Co-Applicant Small Loan Long Term\n", "1 Low Income Low Income Small Loan Long Term\n", "2 Low Income No Co-Applicant Small Loan Long Term\n", "3 Low Income Low Income Small Loan Long Term\n", "4 Low Income No Co-Applicant Small Loan Long Term\n" ] } ], "source": [ "all_columns = df.columns\n", "\n", "#excluding 'Loan_Status' from predictors\n", "predictors = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']\n", "\n", "print(\"Selected Features for Predictors:\")\n", "print(predictors)\n", "\n", "df_selected = df[predictors]\n", "\n", "print(df_selected.head())" ] }, { "cell_type": "markdown", "id": "c4286a38", "metadata": {}, "source": [ "For predicting loan approval, I focused on features that are directly related to financial situations, such as ApplicantIncome, CoapplicantIncome, LoanAmount, and Loan_Amount_Term. I believe that these four show direct involvement in loan prediction, wheras the other features like Gender, MaritalStatus, Dependents, and Education are personal qualities. While these features do impact loan approval in the real world, for my models, I wanted to focus on specifically financial status." ] }, { "cell_type": "code", "execution_count": 10, "id": "60c8d815", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \\\n", "0 1 3 2 0 \n", "1 1 1 2 0 \n", "2 1 3 2 0 \n", "3 1 1 2 0 \n", "4 1 3 2 0 \n", "\n", " Loan_Status \n", "0 1 \n", "1 0 \n", "2 1 \n", "3 1 \n", "4 1 \n" ] } ], "source": [ "label_encoder = LabelEncoder()\n", "\n", "categorical_columns = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Loan_Status']\n", "\n", "for col in categorical_columns:\n", " df[col] = label_encoder.fit_transform(df[col])\n", "\n", "print(df[categorical_columns].head())" ] }, { "cell_type": "markdown", "id": "fea9779d", "metadata": {}, "source": [ "### Training Models" ] }, { "cell_type": "code", "execution_count": 14, "id": "3afb8f90", "metadata": {}, "outputs": [], "source": [ "#focusing on just ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_term.\n", "X = df[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']]\n", "y = df['Loan_Status']\n", "\n", "#splitting data\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)" ] }, { "cell_type": "code", "execution_count": 15, "id": "2f50ef72", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Naive Bayes:\n", " Accuracy: 0.6341\n", " AUC Score: 0.5445\n" ] } ], "source": [ "#training model\n", "nb_model = GaussianNB()\n", "nb_model.fit(X_train, y_train)\n", "\n", "#predicitng and evaluating\n", "y_pred_nb = nb_model.predict(X_test)\n", "accuracy_nb = accuracy_score(y_test, y_pred_nb)\n", "auc_nb = roc_auc_score(y_test, nb_model.predict_proba(X_test)[:, 1])\n", "\n", "print(\"Naive Bayes:\")\n", "print(f\" Accuracy: {accuracy_nb:.4f}\")\n", "print(f\" AUC Score: {auc_nb:.4f}\")" ] }, { "cell_type": "code", "execution_count": 16, "id": "c2e142b5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Logistic Regression:\n", " Accuracy: 0.6504\n", " AUC Score: 0.5576\n" ] } ], "source": [ "#training model\n", "lr_model = LogisticRegression()\n", "lr_model.fit(X_train, y_train)\n", "\n", "#predicitng and evaluating\n", "y_pred_lr = lr_model.predict(X_test)\n", "accuracy_lr = accuracy_score(y_test, y_pred_lr)\n", "auc_lr = roc_auc_score(y_test, lr_model.predict_proba(X_test)[:, 1])\n", "\n", "print(\"Logistic Regression:\")\n", "print(f\" Accuracy: {accuracy_lr:.4f}\")\n", "print(f\" AUC Score: {auc_lr:.4f}\")" ] }, { "cell_type": "code", "execution_count": 17, "id": "4b067d44", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "K-Nearest Neighbors:\n", " Accuracy: 0.5854\n", " AUC Score: 0.5888\n" ] } ], "source": [ "#training model\n", "knn_model = KNeighborsClassifier(n_neighbors=5)\n", "knn_model.fit(X_train, y_train)\n", "\n", "#predicitng and evaluating\n", "y_pred_knn = knn_model.predict(X_test)\n", "accuracy_knn = accuracy_score(y_test, y_pred_knn)\n", "auc_knn = roc_auc_score(y_test, knn_model.predict_proba(X_test)[:, 1])\n", "\n", "print(\"K-Nearest Neighbors:\")\n", "print(f\" Accuracy: {accuracy_knn:.4f}\")\n", "print(f\" AUC Score: {auc_knn:.4f}\")" ] }, { "cell_type": "markdown", "id": "7beca591", "metadata": {}, "source": [ "I chose to use Naive Bayes, Logistic Regression, and K-Nearest Neighbors, as a way to show off a variety of different modeling approaches. Naive Bayes is a probabilistic model, Logistic Regression is a linear model, and K-Nearest Neighbor is a non-linear model. Although each of the models do have various weaknesses, most of the weaknesses observed in these models don't apply to this dataset and are overshadowed by their positives. For instance, Logistic Regression is unable to handle a large number of categorical features, but that doesn't really apply to this dataset (we're only looking at 4 of the features). Additionally, each of the models are easy to train and understand." ] }, { "cell_type": "markdown", "id": "1faae0eb", "metadata": {}, "source": [ "### Analysis" ] }, { "cell_type": "markdown", "id": "a6fb0506", "metadata": {}, "source": [ "Among the three models, Logistic Regression achieved the highest accuracy (0.6504) and a reasonably competitive AUC score (0.5576). Although K-Nearest Neighbors (KNN) had a slightly higher AUC (0.5888), its overall accuracy was lower (0.5854), meaning it may not be as reliable for predictions on this dataset. Naive Bayes, with an accuracy of 0.6341 and the lowest AUC (0.5445), also underperformed compared to Logistic Regression." ] }, { "cell_type": "markdown", "id": "d7f7f363", "metadata": {}, "source": [ "Logistic Regression seemed to be the most effective model for predicting loan approval status, as it returned the highest accuracy while maintaining a decent AUC score. The strength of this model lies in its simplicity and its ability to generalize across the data without overfitting. Logistic Regression performed better than Naive Bayes, which struggled with the independence assumption of features, and K-Nearest Neighbors, which was sensitive to data distribution and potential outliers, leading to lower accuracy.\n", "\n", "While KNN had a marginally higher AUC score, its overall accuracy was lower. In contrast, Logistic Regression provides a good balance, making it suitable for this dataset. Due to its balance and average accuracy scoring, I think that Logistic Regression is good for this task." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }